CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing

Speech enhancement algorithms are applied in multiple levels of enhancement to improve the quality of speech signals under noisy environments known as multi-channel speech enhancement (MCSE) systems. Numerous existing algorithms are used to filter noise in speech enhancement systems, which are typically employed as a pre-processor to reduce noise and improve speech quality. They may, however, be limited in performing well under low signal-to-noise ratio (SNR) situations. The speech devices are exposed to all kinds of environmental noises which may go up to a high-level frequency of noises. The objective of this research is to conduct a noise reduction experiment for a multi-channel speech enhancement (MCSE) system in stationary and non-stationary environmental noisy situations with varying speech signal SNR levels. The experiments examined the performance of the existing and the proposed MCSE systems for environmental noises in filtering low to high SNRs environmental noises (−10 dB to 20 dB). The experiments were conducted using the AURORA and LibriSpeech datasets, which consist of different types of environmental noises. The existing MCSE (BAV-MCSE) makes use of beamforming, adaptive noise reduction and voice activity detection algorithms (BAV) to filter the noises from speech signals. The proposed MCSE (DWT-CNN-MCSE) system was developed based on discrete wavelet transform (DWT) preprocessing and convolution neural network (CNN) for denoising the input noisy speech signals to improve the performance accuracy. The performance of the existing BAV-MCSE and the proposed DWT-CNN-MCSE were measured using spectrogram analysis and word recognition rate (WRR). It was identified that the existing BAV-MCSE reported the highest WRR at 93.77% for a high SNR (at 20 dB) and 5.64% on average for a low SNR (at −10 dB) for different noises. The proposed DWT-CNN-MCSE system has proven to perform well at a low SNR with WRR of 70.55% and the highest improvement (64.91% WRR) at −10 dB SNR.


INTRODUCTION
The speech enhancement system reduces background disturbances/noises while protecting against any changes to speech features to deal with noisy speech signals.Speech enhancement distinguishes between the intended speech and background noise interference (Wang & Chen, 2018).It aims to enhance speech quality to optimize associated signal processing systems, such as wearables (Takada, Seki & Toda, 2018), automatic speech recognition (Donahue, Li & Prabhavalkar, 2018), mobile telephony (Hasannezhad et al., 2021), and hearing prostheses (Syed, Trinh & Mandel, 2018).Many algorithms have been proposed, and the noise issue has been studied extensively for a very long period (Das et al., 2020).Spectral subtraction algorithms (Balaji et al., 2020), Wiener filtering (Yang & Bao, 2018), and nonnegative matrix factorization (Xu et al., 2021) are examples of traditional speech enhancement algorithms.However, only some of these algorithms pay attention to speech enhancement at low signal-to-noise ratio (SNR) conditions, which is more important and challenging than high SNR conditions.Generally, −10 dB to 0 dB SNR levels refer to low SNR's, while 5 dB to 15 dB are high-level SNR's (Wang & Chen, 2018).There are many communication scenarios at low SNR conditions.For instance, walkie-talkies used by employees in metal-cutting factories, wireless headsets used by mechanics when testing a helicopter, and so on.The current focus of research is on improving the performance of communication devices such as microphones, automatic speech recognition (ASR), Voice over Internet Protocol (VoIP), teleconferencing etc.
While speech enhancement at high SNR makes the speech more comfortable for listeners, speech enhancement at low SNR affects the clarity of the speech.The improvement of speech at low SNR is not therefore more significant than at high SNR (Xu et al., 2004).However, dealing with a high noise level in a noisy environment and providing noise-free communication is a trending research topic in this field.Several algorithms, such as spectral subtraction, beamforming, adaptive noise reduction, spectral statistical filter, among others, have been proposed to improve speech quality.
Multi-channel speech enhancement (MCSE) refers to systems that make use of multiple signal inputs, use noise references in adaptive noise cancellation, phase adjustment to cancel unwanted noise components, and combine step-by-step schemes (Kokkinakis & Loizou, 2010).The existing MCSE provides speech recognition at a 71% Word Recognition Rate (WRR) at 10 dB SNR compared to a single microphone (Xu et al., 2004;Stupakov et al., 2012).These multi-channel algorithms (beamforming, adaptive noise reduction and voice activity detection algorithms) suffer from the low performance of recognition rate when SNR is low (−15 dB, −10 dB, −5 dB, 0 dB) (Pauline, Samiappan & Kumar, 2021;Kim, 2020).The existing algorithms developed for MCSE systems were only tested for white Gaussian stationary noise at 0 to 60 dB SNRs and were never tested for non-stationary environmental noises.
The deep learning algorithm is one of the state-of-the-art algorithms in the speech enhancement domain (Rownicka, Bell & Renals, 2020;Ochiai, Delcroix & Nakatani, 2020), which has been proven to have acceptable performance in handling different levels of noise in speech enhancement based on the computing platform.Among the deep learning algorithms, the very deep convolution neural network (VDCNN-conv) reported the highest WRR at 90.45% and an average WRR of 87.45% for environmental noises (Cherukuru, Mumtaz Begum & Hema, 2021).However, MCSE systems have never been experimented with deep learning algorithms.As such, the aim of this research is to propose a MCSE system using deep learning and preprocessing algorithms and examine the performance of the proposed system against the existing MCSE system in filtering environmental noises at low to high SNR conditions.
The rest of the article is structured as follows.The 'Research background' provides an overview of the single-channel and multi-channel speech enhancements and their limitations which include the existing MCSE and deep learning algorithms.The next section describes the proposed deep learning based MCSE system.The 'Methodology' section describes the approach, experimental design, setup, and evaluation methods used on both the benchmark MCSE speech enhancement and the proposed approach of MCSE.The 'Results' section presents the findings of this research, while discussions are presented in the next section.Finally, the last section concludes the proposed work.

RESEARCH BACKGROUND
Many existing algorithms were used to filter noise in MCSE systems and are often used as a pre-processor to improve speech quality.They have proven to be effective in reducing interference signals and improving voice quality.There are two categories of speech enhancement systems, which are single-channel and multi-channel speech enhancement.

Single channel speech enhancement system
The approaches for enhancing speech with only one acquisition channel are known as ''single channel'' algorithms.A single channel is typically not available in most real-time applications such as speaker recognition, voice recognition, mobile communications, and hearing aids, though they are relatively cheaper than multi-channel systems.This is one of the most difficult situations in speech enhancement domain as there is no reference signal available for noise, and clean speech/audio signal cannot be preprocessed before it is affected by the noise (Yadava & Jayanna, 2019;Hossain et al., 2023;Xu, Tu & Yang, 2023).Despite the challenges, there are several algorithms developed and experimented such as subtraction algorithms, over subtraction algorithms, non-linear spectral subtraction, non-linear weighted subtraction, etc.These algorithms improved the performance of speech quality in noisy environments; however, they're computationally intensive and not effective at suppressing noisy audio signals, especially when the SNR is low i.e., −10 dB to 10 dB (Shanmugapriya & Chandra, 2014;Upadhyay & Karmakar, 2015;Saleem et al., 2022).This environmental noise is difficult to filter because it has different characteristics in terms of noisy levels in decibels, frequencies etc. depending on the type of environment.Therefore, MCSE is very much required (Akhaee, Ameri & Marvasti, 2005).
Recently, researchers have given attention to the convolutional neural network (CNN) algorithm for single-channel speech enhancement system.The performance of CNN was measured using various measurement metrics such as mean opinion score (MOS), signal distortion (SIG), and intrusiveness of background noise (BAK).A scale from 1 to 5 is used for SIG, BAK, and MOS, with a higher number being preferred.
While word error rate (WER) or word recognition rate (WRR) is a common metric used to specifically assess the performance of ASR systems, other common objective measures include segmental signal-to-noise ratio (segSNR), distance measures, source-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ), and short-time objective intelligibility (STOI).

Multi-channel speech enhancement system (MCSE)
Microphone arrays and speech enhancement components are built into MCSE that processes multiple channels of audio signals in noisy environments such as outdoor environments (Palla et al., 2017;Pauline, Samiappan & Kumar, 2021).For example, a spectral statistics filter is applied to hearing aids to handle stationary noise environments (Gaussian noise) and unsteady noise environments (factories, babble, and car noises) from −5 dB to 20 dB (Kim, 2020).The current performance rate at low SNR are 2.16 PESQ score with babble noise, 2.20 considered as low quality of signal with Gaussian noise, 2.13 considered as low quality of signal with factory noise and 3.67 PESQ score considered as a medium quality of signal with car noise on an average of −5 dB to 10 db SNR levels (Kim, 2020).
Figure 1 shows the architecture of the existing MCSE system based on beamforming, adaptive noise reduction, and voice activity detection (BAV-MCSE).The architecture consists of a microphone array, beamforming, adaptive noise reduction, and voice activity detection.

• Beamformer
A microphone array can be combined with a spatial filtering signal processor called Beamformer.Beamforming is achieved with filtering the microphone signal, merging the outputs to obtain the desired signal, and filtering away interference noise (Van Veen & Buckley, 1988).Fixed beamforming and adaptive beamforming are the two types of beamforming.The direction of the input signal is fixed in fixed beamforming, and the distance between the microphones is constant.Fixed beamforming is achievable with delay and sum beamformers.In adaptive beamforming, the directivity of the input noisy speech signal varies as the acoustic environment changes (Ramesh Babu & Sridhar, 2021).
• Adaptive Noise Reduction (ANR) The Least mean square (LMS) filter is used to filter environmental noise using Adaptive Noise Reduction (ANR) (Soo & Pang, 1990;Valin, 2007).The ANR is fed by the user beam and reference noise (b1...bn).The ANR component filters the noise from the user beam, which is connected to the reference noise.However, the audio signal is only present in the user beam after it has been processed using beamforming but is not attenuated (Widrow et al., 1975).
• Voice Activity Detection algorithm (VAD) VAD distinguishes the user's voice in the user stream (Karita et al., 2019), which is important for two reasons.The first one is for segmentation, where the system identifies the precise boundaries of each word in spoken utterances.The second one is for data reduction, where the system only sends data when it is needed, rather than continuously transmitting data through the transmission channel.
According to Cherukuru, Mumtaz Begum & Hema (2021) only white Gaussian stationary noise was tested with beamforming, ANR and VAD algorithms in MCSE between 0 dB to 60 dB SNRs, and proving that MCSE is particularly effective at 20 dB SNR (Palla et al., 2017).For the stationary white Gaussian noise, the word recognition rate decreased for low noises at 15 dB to −10 dB, respectively (Cherukuru, Mumtaz Begum & Hema, 2021).The negative dB had a lower WRR than the positive dB but low for 15 dB to −10 dB non-stationary noise.At low SNR conditions, the existing MCSE's WRR was poor.It did, however, perform better in a noisy stationary environment than in a noisy non-stationary environment.

Deep learning-based multi-channel speech enhancement system
Deep neural networks (DNNs) is the first introduced technology for guided speech improvement, have grown in prominence in recent years (Venkatesha Prasad et al., 2002).DNN, also known as the feed-forward fully connected layer or multilayer perception (MLP) with numerous hidden layers, is one of the most used designs for speech enhancement (Zhao et al., 2018).The network is characterised as a completely connected network because every node in the layer has a link with every node in the layer preceding it.Resultantly, DNN contains many parameters.Karjol, Kumar & Ghosh (2018) introduced an enhancement strategy based on numerous DNN-based systems with n number of DNN, each of which contributes to the final enhanced speech, and a gating network that provides weights to combine the DNN outputs.The standard metrics can be used to compare the performance of speech enhancement systems using subjective and objective measures.The model employs n = 4, with each layer consisting of three layers.An average SNR of −5 dB to 10 dB on the TIMIT corpus, yields a seen noise PESQ of 2.65 and an unseen noise PESQ of 2.19.
Although DNN has been used successfully as a regression model for speech enhancement, its improved speech frequently degrades in low SNR conditions (Gao et al., 2016).To enhance the effectiveness of DNN-based speech in low SNR environments, some scholars presented a progressive learning architecture with long short-term memory (LSTM) network (Gao et al., 2018;Santhanavijayan, Kumar & Deepak, 2021).Each target layer is built so that the transition speech with a higher SNR is learned at the final layer, followed by clean speech.Additionally, LSTM-RNN has been used to solve the issues with reverberation (Weninger et al., 2013), loud multichannel speech and extremely nonstationary additive noise (Wollmer et al., 2013).In Wollmer et al. (2013), bottleneck features produced by the bi-directional LSTM network (BiLSTM) outperformed manually created features like MFCC.When employing MFCC, the average word accuracy (WA) is 38.13%, whereas when using batch-normalized long short-term memory (BN-BLSTM), it is 43.55%.The LSTM-RNN has significantly enhanced speech processing systems.However, it is well known that learning the RNN parameters is challenging and time-consuming (Weninger et al., 2013;Wollmer et al., 2013).
This research proposed a noise-reduction framework using pre-processing and deep learning algorithms to overcome the noise issue in MCSE system.Based on Katti & Anusuya (2011), Labied & Belangour (2021), Ping, Li-Zhen & Dong-Feng (2009), discrete wavelet transform (DWT) preprocessing algorithm and CNN algorithm are suitable for filtering noisy environments and improving the quality of speech.According to Labied & Belangour (2021), DWT is effective in denoising speech signal, and can compress the speech signal without degrading the speech quality.In Katti & Anusuya (2011), Ping, Li-Zhen & Dong-Feng (2009), it was stated that CNN has the capacity to detect patterns in neighboring speech structures, and compared to RNN and standard DNN, CNN is more effective in terms of filtering the high level of noise in speech signals.However, it is unable to maintain invariance when the input data changes.Among all the deep learning algorithms, CNN reported the highest WRR at 90.45% and the lowest WRR at 87.45% on average for environmental noises.
Table 1 summarizes the deep learning-based algorithms for single and multi-channel speech enhancement.CNN with preprocessing algorithms is yet to be experimented with the MCSE system to improve speech quality mainly under low SNR conditions.

THE PROPOSED CNN-BASED NOISE REDUCTION FOR MULTI-CHANNEL SPEECH ENHANCEMENT SYSTEM WITH DWT PRE-PROCESSING
This section presents the proposed architecture for an MCSE system by proposing the DWT preprocessing and a CNN-based deep learning algorithm (DWT-CNN-MCSE).DWT algorithm is used as a pre-processing technique to remove selected noise by decomposing the noisy speech signals.CNN algorithm is used to handle feature extraction and classification.The proposed architecture is depicted in Fig. 2.

Discrete wavelet transform (DWT) preprocessing
This research adopted DWT among other algorithms as its performance is very effective in terms of denoising the speech signal and compressing the speech signal without any significant loss in speech quality (Katti & Anusuya, 2011;Labied & Belangour, 2021;Ping, Li-Zhen & Dong-Feng, 2009).
This algorithm aims to create by rescaling and iterating through a series of filters.Upsampling and down-sampling (subsampling) processes determine the signal's resolution (detail information), whereas filtering operations determine its scale (resolution).As there is a lack of preprocessing algorithms implemented on MCSE systems, this article used the existing DWT preprocessing algorithm to remove the redundant data from noisy speech signals.
This research implements a discrete wavelet-based algorithm for the signals obtained through MEMS microphones.Algorithm 1 explains the step-by-step procedure of DWT applied in this research.To ensure the wavelet series is properly computed, which is a sampled form of continuous wavelet transform (CWT), it may take a significant amount of time and resources.There is evidence that the sub-band coding-based DWT is more efficient in computing wavelet transforms.It is simple to implement, and it decreases the time and resources needed for computation.Digital filtering algorithms are used to obtain a time-scale depiction of the digital signal in DWT.Filters with various cutoff frequencies and scales are used to evaluate the input signal.

Advantages:
If GAN is correctly trained, its combined networks can be very strong.

Disadvantages:
The adversarial training is typically challenging and unstable.Xdata[] stores the input data vector, and Ydata[ ] is the output data vector that is returned.N is the length of both data vectors.Before applying this approach, it is presumable that the wavelet filter parameters G[k] and the scale filter parameters H[k] have been provided.L is the total number of parameters.N must be an even number to work with this algorithm.
Step 10: Step 11: Step 12: endfor; Step 13: j = j + 1; // Update position in output array Step 14: endfor; knowledge of the speech signal, they can take advantage of the long-term dependencies between the speech frames.However, CNNs in speech communication systems cannot handle many semi-clean data, resulting in reduced performance.To overcome these issues bidirectional long short-term memory (BiLSTM), which regulates the flow of information by an individual component called a memory block, was developed by Weninger et al. (2013).
The fundamental purpose of CNN is to detect local structure in input data.The spectrum correlations in acoustic features are well-modeled by CNN, which successfully decreases the spectral fluctuations.Three distinct models such as CNN, BiLSTM, and fully connected layers are included in the suggested architecture as illustrated in Fig. 3.
Convolutional layers are used to reduce the frequency variance in the input signal at first.Two CNN layers with 256 feature mappings in each convolutional layer were chosen at first.This is because speech has a tiny feature dimension (i.e., 40).The behavior of the high-and low-frequency zones is vastly different.Nearly 16% of the feature map's original size has been decreased using two convolutional layers.As a result, modeling locality and eliminating invariance is no longer necessary.Sainath et al. (2013) states that the first convolutional layer has a 9 by 9 frequency-time filter, while the second layer has a 4 by 3 frequency-time filter.A 9 by 9 frequency-time filter is used in the first convolution layer, and a 4 by 3 frequency-time filter is used in the second.In the beginning, our model employs solely the frequency-domain pooling using max pooling.Similarly, the pooling size is 2 for both layers, and the stride value is 2. The next layer in CNN has a greater dimension since the set of feature maps, time, and frequency is proportional to the layer's size.Therefore, the feature dimensions must be reduced.After CNN layers, a linear layer is applied to reduce the layer's size without sacrificing accuracy.
Algorithm 2 shows the processes involved in CNN through different layers.Frequency modeling is an algorithm for reducing the data dimensionality of 236 suitable outputs by using linear layers.To simulate the signal in time, the output of the CNN layer is passed to the BiLSTM layer.In this case, two BiLSTM and three FC layers would be ideal.However, the number of layers can vary depending on the experiment.Each BiLSTM layer has 832 cells and 512 units (256 LSTM units per direction) of the projection layer for feature extraction (256 LSTM units).Twenty-times steps are pre-trained into the BiLSTM, and backpropagation is truncated.The output of BiLSTM layers is sent to FC layers after frequency and timing modeling.Higher-order feature representations that are easily distinguishable between classes can be generated by using these layers.1024 hidden units can be found in all fully connected layers.
Speech variation results from the accent, volume, and other characteristics can distinguish distinct speakers.The proposed approach uses shared weights obtained by applying several convolution operations.These convolutions generate features and are supplied to the Max pooling layer.The shared weights mechanism helps retain the toplevel and low-level attributes as well as improve the accuracy in terms of WRR.Further, these attributes are processed through the Linear Layer, which supplies these features to CNN-BiLSTM layer (Sermanet, Chintala & LeCun, 2012;Passricha & Aggarwal, 2019).In most CNN work, FC layers discriminate between classes based on local knowledge.CNN-BiLSTM module is used for energy and timing modeling, and the softmax layer is utilized to distinguish between different classes.The entire model is trained at the same time.Output: enhanced speech signal with recognition rate performance.
Step 1: capture speech signals by using DMA microphone array Step 2: Apply an analogue to digital converter to convert an analogue signal into a digital signal.
Step 3: apply wavelet transform by applying Step 5: obtain the final output speech data and measure the performance

RESEARCH METHOD
This research aims to examine the performance of the proposed MCSE systems in filtering stationary and non-stationary environmental noises with low to high SNRs environmental noises (−10 dB to 20 dB).The experiments compare the ability of the proposed DWT-CNN-MCSE system against the existing BAV-MCSE in filtering the noise environment at low SNR conditions.The existing and the proposed MCSE systems were evaluated in terms of spectrogram analysis and WRR.

Experimental design
The experimental design of this research was based on the researchers' previous work Cherukuru, Mumtaz Begum & Hema (2021), which was for environmental noises at different levels of SNRs to determine the limitations of the existing algorithms in handling environmental noises.From our previous work (Cherukuru, Mumtaz Begum & Hema, 2021), we found that the existing MCSE shows an acceptable recognition rate at high SNR levels but not for low SNR levels.To overcome the problem of low recognition rate

Speech dataset
The AURORA and the LibriSpeech datasets were used to train the deep learning models to test the MCSE systems (benchmark and the proposed systems) in noisy environments at various SNR levels of voice signals.
• AURORA There were 13 distinct male voices and 16 distinct female voices among the 25 utterances taken from the AURORA noisy sample (Karjol, Kumar & Ghosh, 2018).Even though the number of trials changes based on the noise level, at least 25 samples were taken for each dB level.
Five types of non-stationary environmental noise are represented in the noisy speech utterances, which are airport, babble, exhibition, car, and restaurant noises.One stationary noise type, white Gaussian noise, was examined at seven different SNRs: −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB.Other noise types that were investigated included babble, car, exhibition, and restaurant noises, while −10 dB was for loud speech signals.A total of 25 utterances from the AURORA clean training dataset were selected, and −10 dB noise signals were purposefully mixed with it.For every 25 utterances, 42 different conditions were prepared.
In the proposed experiment, the AURORA database, which is taken from the internationally recognized NOIZEUS database for the evaluation of speech enhancement algorithms was used.This database includes the speech recordings of speakers, three men,  (Loizou, 2009).To achieve the appropriate SNR levels, the recovered noise segments were artificially introduced to the clean speech signal.The entire dataset was split into two sets: the training dataset and the testing dataset.20% of the dataset was used for testing, while the remaining 80% was used to train the CNN algorithm (Gholamy, Kreinovich & Kosheleva, 2018).
• LibriSpeech LibriSpeech noisy dataset consists of a single male voice with 6 different conditions under 3 different noises (Panayotov et al., 2015;Park et al., 2020).Three types of noises; Dog bark, Ambulance and fan noises, were examined at seven different SNRs: −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB. 1 utterance from the LibriSpeech which is a clean speech signal was selected for training the CNN algorithm and 1 utterance with 18 different conditions was prepared with different noises at different levels of SNR.This entire dataset was used for both testing and training.20% of the dataset was used to test the CNN algorithm and 80% of the dataset was used for training the CNN algorithm.We also considered this dataset to evaluate the performance accuracy of the proposed MCSE system in terms of spectrogram analysis and word recognition rate.This dataset is used to evaluate the performance of the proposed system.
The details of the AURORA and LibriSpeech datasets used in the experiments are presented in Table 3.

The experimental setup
This research conducted two types of experiments namely the (1) benchmark MCSE based on beamforming, ANR and VAD algorithms (BAV-MCSE) and ( 2) the proposed MCSE based on deep learning algorithm with DWT pre-processing (DWT-CNN-MCSE).

The benchmark MCSE system
In the benchmark experiment, the BAV-MCSE system is experimented with using Beamforming, ANR, and VAD algorithms (BAV).In this experiment, MEMS microphone array captures the noisy speech signals and sends the signals to fixed beamforming to separate the audio and noisy beams, the output of the beamforming goes to ANR to filter the noise based on reference noise, and finally, VAD separates the voiced speech signals and the result is evaluated with an ASR engine.
The experimental setup is as follows: • Device configuration is based on the researchers' previous work  • Sampling setup For sampling, we used pre-amplified two transducers output, which were passed to a single-stage bandpass filter (80 Hz-16 kHz), gain adjusted, level shifted to 1.75 V, and then fed to individual ADCs (analog to digital converter).
For this research, the ADCs were configured at 12-bit vertical resolution and 16,000 Samples per second (+/-50).Data is saved in SD card using Conversion complete interrupt linked to the DMA channel that writes in SD card and Buffer variable defined in RAM, where both sampling times were synchronized.The amplifier used was LM358 general-purpose Opamp.
• Variability setup Due to the increased ARM deflection of the Servos, the Timer 1 PWM channels were used to connect two 9G servos with 16-bit resolution.There is a 10 mm gap between each microphone.
• Setup of the noise and sample utterance system Edifier 2.0, a channel speaker, served as the main noise maker.The noise samples were continuously looped and transferred from the BeagleBone Black Single Board to the amplifier, where the speech is altered.Only the left channel of the Logitech USB speakers were used to enter the samples into the Beagle Bone Black single-board computer.
• SNR setup The required SNR (−10 dB, −5 dB, 0dB, 5 dB, 10 dB, 15 dB, and 20 dB) was derived by adjusting the noise sound amplifier gain and the sample utterance amplifier gain control.

The proposed MCSE system
In the experiment, the proposed DWT-CNN-MCSE system is experimented using DWT as a pre-processing and CNN-based deep learning technique.
The proposed algorithm is implemented using MATLAB 2021a.This tool is widely adopted for various signal-processing tasks such as image processing, speech-processing, and ECG signals.DWT preprocessing algorithm is used to create the signal's detailed information by rescaling and iterating through a series of filters.Up-sampling and downsampling (subsampling) processes determine the signal's resolution, whereas filtering operations determine its scale (resolution).In simple terms, DWT decomposes the signal into different frequency bands.It effectively denoises the speech signal and compresses the speech signal without any significant loss in speech quality.The output of the DWT goes to CNN in which data is processed through multiple layers such as convolution, pooling fully connected, max pooling, linear layer, BiLSTM, fully connected layer and softmax layer to learn its attributes and improve the recognition accuracy.
In this study, this research employed these tools for speech-processing tasks.The proposed algorithm uses the following toolboxes: • Audio toolbox: Audio Toolbox offers audio processing, speech analysis, and acoustic measurement tools.It provides algorithms to evaluate acoustic signal metrics, and to train machine learning and deep learning models.Researchers can import, categorise, and enhance audio data sets using Audio Toolbox.
• Data acquisition toolbox: The Data Acquisition Toolbox TM includes programs and features for configuring data collection devices, reading data into MATLAB and Simulink, and publishing data to DAQ analogue and digital output channels.
• Digital signal processing toolbox With the DSP System Toolbox, researchers can create and examine FIR, IIR, multi-rate, multistage, and adaptive filters.
• Wavelet toolbox Wavelet Toolbox offers functions and applications to analyze and synthesize signals and images.Researchers can analyze signals and images at various resolutions using discrete wavelet analysis to find changepoints, discontinuities, and other events that are not immediately visible in raw data.
• Deep learning toolbox With methodologies, pre-trained models, and applications, the deep Learning Toolbox was used for developing and integrating deep neural networks into applications.

Evaluation methods
This research evaluates the performance of a multi-channel speech enhancement system in a noisy environment (stationary and non-stationary noise) using spectrogram analysis and WRR.

• Spectrogram analysis
The amplitude of speech signals is analyzed using spectrogram analysis (Haykin et al., 1991).MATLAB is used for time-domain spectrogram analysis for both stationary noises (white Gaussian noise) and non-stationary noises in the environment (Babble, Car, Exhibition, Airport, and Restaurant) from the AURORA database and ambulance noise from the LibriSpeech database.The spectrogram reflects the change in amplitude, frequency, wavelength and time at different levels of SNR's.In this experiment, we analyzed the spectrograms with signals amplitude in time domain.
• Word recognition rate (WRR) The word recognition rate is used to assess the performance of multi-channel speech enhancement systems.WRR measures the performance accuracy of multi-channel speech enhancement system.The following formula is used to calculate WRR: N is the total amount of words or letters in the sentence, S the number of times other words have been substituted for them, and D denotes the number of words that have been deleted.In a sentence, I represent the number of insertions.

Spectrogram analysis
Figure 4 shows the spectrograms of clean speech and Table IV shows  The DWT-CNN-MCSE provides substantial noise suppression compared to unprocessed speech and the DWT-CNN algorithms eliminate almost all the noises in the entire frequency range.
This research has also experimented with the proposed DWT-CNN-MCSE using LibriSpeech.Table 5 shows the spectrograms of the original signal, noisy speech signal and enhanced speech signal under ambulance noise at −10 dB SNR level.As compared to ambulance original noisy speech signal, the proposed MCSE offers significant noise reduction, and the DWT-CNN algorithms nearly eliminate all noises over the whole frequency band.

Word recognition rate (WRR)
Tables 6 and 7 present the results of the benchmark BAV-MCSE and proposed DWT-CNN-MCSE tested using the AURORA dataset at different levels of SNRs under stationary and non-stationary noisy environments.By comparing the performance of the developed noise reduction system in filtering various SNR of environmental noises, the following are  For non-stationary noises, Fig. 5 demonstrates the variations in the WRR for both the BAV-MCSE and the proposed DWT-CNN-MCSE.In comparison to BAV-MCSE, the proposed MCSE is particularly good in recognizing speech in non-stationary noisy conditions.Finally, to determine if the results for BAV-MCSE and proposed DWT-CNN-MCSE differed significantly, we used Analysis of Variance (ANOVA), and the results are presented in Fig. 6.From Fig. 6, the result shows that the proposed algorithms' scores are significantly different from the existing algorithm (BAV-MCSE) under non-stationary environment.This further reveal that the proposed algorithm (DWT-CNN-MCSE) has statistically different result at p-value less than 0.05.For stationary noises, Fig. 7 demonstrates the changes in the WRR for both the BAV-MCSE and the proposed DWT-CNN-MCSE.In comparison to the existing BAV-MCSE, the proposed DWT-CNN-MCSE is good in recognizing speech in stationary noisy conditions.Finally, to determine if the results for BAV-MCSE and proposed DWT-CNN-MCSE were significantly different, we used Analysis of Variance (ANOVA), and the results are depicted in Fig. 8. From Fig. 8, the result shows that the proposed DWT-CNN-MCSE scores are significantly different from the existing BAV-MCSE under stationary environment at p value less than 0.05.
Table 8 presents the results of the proposed DWT-CNN-MCSE tested using the LibriSpeech dataset at different levels of SNRs for the ambulance, dog bark and fan  The spectrograms of the enhanced speech obtained with all processing methods are depicted in Tables 4 and 5.The spectrograms of BAV-MCSE have lost some important speech contents such as some of the speeches are missing, hence provided less speech recognition rate as compared to DWT-CNN-MCSE which is evident in Tables 4(c) and 5(c).If we note the spectrogram of DWT-CNN-MCSE, we obtained a close replica of the clean speech spectrogram and important speech contents are effectively preserved.Also, low noise is observed in the spectrogram of DWT-CNN-MCSE output speech.The time-domain waveforms of the enhanced speech utterances obtained with all the processing methods are depicted in Tables 4 and 5.The waveforms of BAV-MCSE have some noise, hence provided less recognition rate as compared to DWT-CNN-MCSE which is evident in Tables 4(b), (c) and 5(b), (c).Low noise is observed in the waveform of DWT-CNN-MCSE output speech.
The result of word recognition shows that the benchmark MCSE could not function adequately in low SNR settings.However, MCSE performed better in a noisy stationary environment than non-stationary environment.It was also discovered that the MCSE algorithms perform well in both stationary and nonstationary noisy environments at high SNR.The linear relationship between SNR and WRR shows that MCSE successfully filters noise at higher SNR and not at lower SNR, as the strength of the noise is too low for MCSE to filter it out as beamforming, ANR and VAD algorithms are more sensitive at low SNR conditions.
We also analyzed the proposed system in terms of WRR at different levels of SNR under stationary and non-stationary noises from the AURORA and LibriSpeech databases.The result of the WRR shows MCSE could function adequately in low and high SNR conditions under stationary and non-stationary noisy environments.However, there is a need to make the DWT-CNN-MCSE robust noisy free system.
The proposed framework in MCSE, which included a pre-processing algorithm based on DWT and a deep learning algorithm based on CNN, outperformed the benchmark algorithms in detection in noisy environments, especially at low SNR conditions in terms of word recognition rate accuracy.We noticed that WRR of the benchmark MCSE provided good results only at 15dB SNR compared to the proposed MCSE.It shows that the proposed MCSE is sensitive at 15dB SNR level under non-stationary environments.We also noticed that the proposed MCSE has a WRR that is twice as high as the benchmark MCSE at −10 dB, −5 dB, 0 dB, and 5 dB SNR levels under both stationary and non-stationary environments.

CONCLUSION
The multi-channel speech activity-related devices are commonly used in various real-time applications, and the communication or speech quality performance of these devices is degraded by various types of noises.Thus, improving the quality of speech signals is important for these multi-channel devices.To deal with various environmental noises, we propose an MCSE using deep learning and preprocessing algorithms and examine the performance of the proposed MCSE system in filtering the environmental noises at low to high SNR.A new architecture is presented, which considers wavelet transform, deep learning (CNN), and BiLSTM model to learn the data pattern and trained to obtain the filtered signals.The proposed system shows considerable performance when compared to related studies.By comparing the performance of the proposed system in handling various SNRs of environmental noises, it achieved a WRR of 70.55% at −10 dB SNR and 75.44% at 15 dB SNR, as compared to the existing MCSE system at 5.82% at −10 dB and 88.8% at 15 dB.It can be inferred from the comparison that the proposed system has outperformed the benchmark MCSE system.From the ANOVA analysis, the result indicated that the proposed MCSE's scores are significantly different from the existing MCSE system.The word recognition accuracy is achieved at an acceptable rate at low SNR.

FUTURE DIRECTIONS
Our current work focuses on deep learning algorithms, where the proposed system outperforms the benchmark system.However, one of the limitations of deep learning-based algorithms is the high computational costs.As such, implementing deep learning-based approaches in portable communication devices may be difficult due to the low computing power of these devices.In this regard, one possible way is to combine the existing filters and deep learning approaches to enhance the speech quality and intelligibility of the output.We plan to conduct more investigations into wider types of noises and more effective speech enhancement algorithms to improve the performance of the multi-channel speech enhancement system.

Algorithm 2 :
Processing DWT output signals through CNN-BLSTM Algorithm Input: speech signals, Deep learning parameter (batch size, feature dimension, classes, train test ratio).
dt • Decompose signal into LL, HL, LH, and HH bands by computing the wavelet coefficients as c jk = W ψ f 2 −j ,k2 −j Step 4: Input these coefficients to deep learning • Process through convolutional layers n out = n in +2p−k s + 1, n in denotes the input attributes, n out denotes the output features, k convolution kernel size, p padding size, s is the stride • Process the convolved data through pooling layer h l xy = max i=0,...s,j=0,..s h l−1 (x+1)(y+j) • Perform linearization by applying linear layer • Apply BiLSTM layer • Process the memory unit data through fully connected layer z l = W l h l−1 • Soft max layer softmax (z i ) = e

34 Figure 4
Figure4shows the spectrograms of clean speech and TableIVshows and compares the spectrograms of a sample utterance under six different conditions from the AURORA database which include the spectrogram analysis: (a) Noisy speech at various SNR levels, and (b) enhanced speech applying benchmark BAV-MCSE (c) enhanced speech using

Table 2 Experimental design.
for low SNR levels, this research proposed a deep learning-based algorithms to improve the recognition accuracy of the MCSE system.The experimental design of the proposed DWT-CNN-MCSE system is shown in Table2.The experiment adds noise to the original signals at levels of −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, and 15 dB before processing them through the considered MCSE system.

Cherukuru and Mustafa (2024), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.1901 14/34 and
three women, reciting 30 sentences from the IEEE sentence database.The University of Texas at Dallas' Speech Processing Lab used Tucker Davis Technology (TDT) to capture each speaker's five words at a sampling frequency of 25 kHz, which was later down sampled to 8 kHz.Every sentence is accompanied by a variety of background noises, including those from an airport, a restaurant, a car, an exhibition, and an AWGN.To get both clean and noisy signals, this research employed intermediate reference system (IRS) filters

Table 8 WRR performance of proposed WSE under non-stationary noises.
We noticed that among all the three noises, the highest WRR of 74.38% is obtained at 0dB under dog bark and lowest WRR of 61.25% is obtained at −5 dB under fan noise.In this study, we carried out spectrogram analysis and WRR on MCSE systems.The spectrograms of noisy speech signals and enhanced speech signals of different noisy speech signals at −5 dB SNR from the AURORA dataset are presented for both the benchmark and proposed MCSE system.As compared to the benchmark system, the proposed system's enhanced spectrograms have clear signals and are closer to the clean speech signals under stationary and non-stationary environmental noise.Similarly, for the LibriSpeech dataset, spectrograms of noisy speech signals and enhanced speech signals were analyzed and the proposed MCSE showed a clear signal and closer to the clean signal under ambulance noises.We noticed the spectrograms of the enhanced speech signals by DWT-CNN-MCSE offer significant noise reduction when compared raw noisy speech, and the DWT-CNN algorithms filtered maximum noise throughout the whole signal spectrum. DISCUSSION